Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-12553: ensure that memtable is reclaimed even when notification subscribers throw #1545

Conversation

jakubzytka
Copy link

What is the issue

Cassandra doesn't properly support throwing notification subscribers
that fail flushes. In such case the flush is interrupted (despite
multiple uses of exception-safe code and accumulating exceptions)
after the sstable creation transaction committed, but before the
memtable has been reclaimed. As a result the memtable allocator believes
more and more memory is being used and being reclaimed eventually
stopping writes due to apparent lack of memory in the memtable.

What does this PR fix and why was it fixed

This patch changes memtable flushing behaviour so that the memtable
is reclaimed iff it has been removed from the View, regardless
of whether the flush fails or not.

Copy link

github-actions bot commented Feb 4, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

@jakubzytka jakubzytka force-pushed the cndb-12553-ensure-memtable-reclaimed-when-notification-subscriber-throws branch from 337bcb0 to 67e28f6 Compare February 5, 2025 09:53
@jakubzytka jakubzytka requested a review from a team February 5, 2025 10:30
@jacek-lewandowski jacek-lewandowski self-requested a review February 5, 2025 14:36
cfs.replaceFlushed(memtable, Collections.emptyList(), Optional.empty());
reclaim(memtable);
return Collections.emptyList();
try
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the patch seems reasonable to me. However, as you said, it is not precisely defined what should happen in case of failure in replaceFlushed. To me, since we have no answer, perhaps the best way to deal with that would be to shutdown the node, and at the same time, make sure that the notification consumer implementation does not throw any exception. Otherwise - from the CNDB point of view - is a failure in notification consumer critical? Can we continue if it happens?

Copy link
Member

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides nits from Jacek

subscribers throw

The direct cause of CNDB-12553 is that CNDB-specific subscriber to
SSTableAddingNotification throws an error, and Cassandra doesn't
handle it properly. In such case the flush is interrupted (despite
multiple uses of exception-safe code and accumulating exceptions)
after the sstable creation transaction committed, but before the
memtable has been reclaimed. As a result the memtable allocator believes
more and more memory is being used and being reclaimed eventually
stopping writes due to apparent lack of memory in the memtable.

This patch changes memtable flushing behaviour so that the memtable
is reclaimed iff it has been removed from the View, regardless
of whether the flush fails or not.
@jakubzytka jakubzytka force-pushed the cndb-12553-ensure-memtable-reclaimed-when-notification-subscriber-throws branch from 67e28f6 to d262cb3 Compare February 25, 2025 11:47
@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1545 approved by Butler


Approved by Butler
See build details here

@blambov
Copy link

blambov commented Feb 25, 2025

FYI this is one of the failure scenarios of CNDB's SSTable management that could lead to data loss.

@jakubzytka
Copy link
Author

jakubzytka commented Feb 25, 2025

FYI this is one of the failure scenarios of CNDB's SSTable management that could lead to data loss.

I disagree. Each storage notification consumer is run independently. Thus, an exception in one notification consumer does not impact our ability to update ETCD in another one.
The data loss scenario is when we are unable to update ETCD because of other reasons.

(IMO this is not an argument to keep ETCD update outside of transaction handling; I just wanted to clarify that the consumers are independent)

@jakubzytka jakubzytka merged commit 8d0b97e into main Feb 25, 2025
461 of 473 checks passed
@jakubzytka jakubzytka deleted the cndb-12553-ensure-memtable-reclaimed-when-notification-subscriber-throws branch February 25, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants